REDAC: Distributed, Asynchronous Redundancy in Shared Memory Servers
نویسندگان
چکیده
The emergence of multi-core architectures—driven by continued technology scaling—has led to concerns about increasing softand hard-error rates in commodity designs. Because modern chip designs consist of multiple high-speed clock domains, conventional lockstepped redundant execution is no longer practical. Recent work suggests an asynchronous approach to redundant execution, where processor pairs independently execute an instruction stream and treat any differences like soft errors, invoking rollback recovery. Because prior designs buffer instruction results within the out-of-order instruction window, they are limited to tightly coupled redundancy within a single chip, which limits availability and serviceability in the presence of hard errors. We propose REDAC, a set of lightweight mechanisms for distributed, asynchronous redundancy within a sharedmemory multiprocessor. REDAC provides scalable buffering for unchecked state updates, permitting the distribution of redundant execution across multiple nodes of a scalable shared-memory server. The REDAC mechanisms achieve high performance by enabling speculation across common serializing instructions and mitigating the effects of input incoherence. We evaluate REDAC using cycle-accurate fullsystem simulation of common enterprise workloads and show that performance overheads average just 10% when compared to a non-redundant system. These results are comparable to the performance of a similarly configured lockstep design, but offer the substantial benefits of asynchronous redundancy.
منابع مشابه
Shared Distributed Memory : the Workspace Model
Shared Distributed Memory Systems offer uniform access to data which are distributed on servers. The Workspace model is a model of shared distributed memory. It is based on communicating processes which are both clients and servers. It enables to implement hierarchical views of data, to enhance security and it adapts to heterogenous networks.
متن کاملSimulating a Shared Register in a System that Never Stops Changing
Simulating a shared register can mask the intricacies of designing algorithms for asynchronous message-passing systems subject to crash failures, since it allows them to run algorithms designed for the simpler shared-memory model. Typically such simulations replicate the value of the register in multiple servers and require readers and writers to communicate with a majority of servers. The succ...
متن کاملStorage-Efficient Shared Memory Emulation
Improvements in communication fabrics have enabled access to ever larger pools of data with decreasing access latencies, bringing large-scale memory fabrics closer to feasibility. However, with an increase in scale come new challenges. Since more systems are aggregated, maintaining a certain level of reliability requires increasing the storage redundancy, typically via data replication. The cor...
متن کاملA Turn Function Scheme Realized in the Asynchronous Single-Writer/Multi-reader Shared Memory Model
We consider a set of users wishing to receive a service in an asynchronous distributed system. Such users declare their wishes and then wait to gain admittance to be served. Except for the initial transient period, at least one user must be waiting to be served, and the system should be as fair as possible for users. A procedure that ensures such a situation is called a turn function. It can be...
متن کاملDistributed Symbolic Computation with DTS
We describe the design and implementation of the Distributed Threads System (DTS), a programming environment for the paralleliza-tion of irregular and highly data-dependent algorithms. DTS extends the support for fork/join parallel programming from shared memory threads to a distributed memory environment. It is currently implemented on top of PVM, adding an asynchronous RPC abstraction and tur...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008